Reading-Time Annotations for "Balanced Corpus of Contemporary Written Japanese"

نویسندگان

  • Masayuki Asahara
  • Hajime Ono
  • Edson T. Miyamoto
چکیده

The Dundee Eyetracking Corpus contains eyetracking data collected while native speakers of English and French read newspaper editorial articles. Similar resources for other languages are still rare, especially for languages in which words are not overtly delimited with spaces. This is a report on a project to build an eyetracking corpus for Japanese. Measurements were collected while 24 native speakers of Japanese read excerpts from the Balanced Corpus of Contemporary Written Japanese Texts were presented with or without segmentation (i.e. with or without space at the boundaries between bunsetsu segmentations) and with two types of methodologies (eyetracking and self-paced reading presentation). Readers’ background information including vocabulary-size estimation and Japanese reading-span score were also collected. As an example of the possible uses for the corpus, we also report analyses investigating the phenomena of anti-locality.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Between Reading Time and Syntactic/Semantic Categories

This article presents a contrastive analysis between reading time and syntactic/semantic categories in Japanese. We overlaid the reading time annotation of BCCWJ-EyeTrack and a syntactic/semantic category information annotation on the ‘Balanced Corpus of Contemporary Written Japanese’. Statistical analysis based on a mixed linear model showed that verbal phrases tend to have shorter reading tim...

متن کامل

Balanced Corpus of Contemporary Written Japanese

Construction of 100 million words balanced corpus of contemporary written Japanese is underway at the National Institute for Japanese Language. The unique property of the corpus consists in that the majority of its sample texts are selected randomly from well-defined statistical populations covering wide range of written texts.

متن کامل

An Approach toward Register Classification of Book Samples in the Balanced Corpus of Contemporary Written Japanese

Japanese books are usually classified into ten genres by Nippon Decimal Classification (NDC) based on their subject. However, this classification is sometimes insufficient for corpus studies which describe characteristics of the texts in the book. Here, we propose a method of classifying text samples taken from Japanese books into some registers and text types. Firstly, we discuss useful criter...

متن کامل

BCCWJ-DepPara: A Syntactic Annotation Treebank on the 'Balanced Corpus of Contemporary Written Japanese'

Paratactic syntactic structures are difficult to represent in syntactic dependency tree structures. As such, we propose an annotation schema for syntactic dependency annotation of Japanese, in which coordinate structures are separated from and overlaid on bunsetsu(base phrase unit)-based dependency. The schema represents nested coordinate structures, non-constituent conjuncts, and forward shari...

متن کامل

KOTONOHA and BCCWJ: Development of a Balanced Corpus of Contemporary Written Japanese

The National Institute for Japanese Language (NIJL) has launched a long-term language corpus development initiative aiming at the development of a super-corpus called KOTONOHA, which is consisting of a multitude of independent corpora. Among the constituent corpora of KOTONOHA, the one that bears the most urgent need is a largescale balanced corpus of the present-day written Japanese. Construct...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016